STRASSEN'S MATRIX MULTIPLICATION ALGORITHM FOR 
MATRICES OF ARBITRARY ORDER 

(3 i IVO HEDTKE 

^^' Abstract. The well known algorithm of Volker Strassen for matrix mul- 

^ ' tiplication can only be used for (m2* X m2^) matrices. For arbitrary (n X n) 

matrices one has to add zero rows and columns to the given matrices to use 
Strassen's algorithm. Strassen gave a strategy of how to set m and k for 
arbitrary n to ensure n < ml . In this paper we study the number d of ad- 
ditional zero rows and columns and the influence on the number of flops used 
by the algorithm in the worst case (d = n/16), best case {d = 1) and in the 
average case (d si n/48). The aim of this work is to give a detailed analysis 
of the number of additional zero rows and columns and the additional work 
caused by Strassen's bad parameters. Strassen used the parameters m and 
k to show that his matrix multiplication algorithm needs less than 4.7n'°S2 "^ 
r^ . , flops. We can show in this paper, that these parameters cause an additional 

work of approximately 20 % in the worst case in comparison to the optimal 
strategy for the worst case. This is the main reason for the search for better 
parameters. 
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1. Introduction 



In his paper ''Gaussian Elimination is not OptimaF ([H]) VOLKER STRASSEN 
*^ . developed a recursive algorithm (we will call it S) for multiplication of square 

I • I matrices of order m2'^. The algorithm itself is described below. Further details can 

^— V ■ be found in [8,, p. 31]. 

(•~^ ' Before we start with our analysis of the parameters of Strassen's algorithm 

we will have a short look on the history of fast matrix multiplication. The naive 
algorithm for matrix multiplication is an 0{n^) algorithm. In 1969 Strassen 
showed that there is an 0{n'^-^^) algorithm for this problem. Shmuel Winograd 
optimized Strassen's algorithm. While the Strassen- Winograd algorithm is a 
t^ . variant that is always implemented (for example in the famous GEMMW package) , 

there are faster ones (in theory) that are impractical to implement. The fastest 
known algorithm, devised in 1987 by Don Coppersmith and Winograd, runs in 
0{n'^'^^) time. There is also an interesting group-theoretic approach to fast matrix 
multiphcation from Henry Cohn and Christopher Umans, see [4], [^ and [13]. 
Most researchers believe that an optimal algorithm with 0{n^) runtime exists, since 
then no further progress was made in finding one. 

Because modern architectures have complex memory hierarchies and increasing 
parallelism, performance has become a complex tradeoff, not just a simple matter 
of counting flops (in this article one flop means one floating-point operation, that 
means one addition is a fiop and one multiplication is one flop, too). Algorithms 
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which make use of this technology were described by Paolo D' Alberto and 
Alexandru Nicolau in [1]. An also well known method is Tiling: The normal 
algorithm can be speeded up by a factor of two by using a six loop implementation 
that blocks submatrices so that the data passes through the LI Cache only once. 

1.1. The algorithm. Let A and B be {m2^ xm2^) matrices. To compute C :— AB 
let 



All 

A21 



A12 
A22 



and B = 



Bii 
B21 



B12 
B22 



where Aij 
matrices 

Hi 
H3 
H5 
Hj 

we get 



and Bij are matrices of order ■m2'' ^. With the following auxiliary 
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This leads to recursive computation. In the last step of the recursion the products 
of the {m X m) matrices are computed with the naive algorithm (straight forward 
implementation with three for- loops, we will call it A/"). 

1.2. Properties of the algorithm. The algorithm S needs (see [T3]) 

Fs{m, k) := 7'=m2(2m + 5) - i'^Gm^ 

flops to compute the product of two square matrices of order m2'^. The naive 



algorithm A^ needs n^ multiplications and n^ - 
of two (n X n) matrices. Therefore Fj\f{n) - 



i^ additions to compute the product 



2n-^ 



In the case n = 2^ the 



algorithm S is better than J\f if Fs{l,p) < Fj\f{2P), which is the case iff p ^ 10. 
But if we use algorithm S only for matrices of order at least 2^° — 1024, we get a 
new problem: 

Lemma. The algorithm S needs ^^{7^ — 4^) units of memory (we write "uom" 
in short) (number of floats or doubles) to compute the product of two (2^ x 2^) 
matrices. 

Proof. Let M{n) be the number of uom used by S to compute the product of 
matrices of order n. The matrices Aij, Bij and Hi need 15(r7,/2)^ uom. During the 
computation of the auxiliary matrices Hi we need 7M(n/2) uom and 2(n/2)^ uom 
as input arguments for the recursive calls of <S. Therefore we get M{n) = 7M(n/2) + 
(17/4)^2. Together with M(l) = this yields to M{2p) = ^{7p - 4^). D 



As an example, if we compute the product of two (2^° x 2^°) matrices (represented 
as double arrays) with iS we need 8 ■ ^(7^° — 4^°) bytes, i.e. 12.76 gigabytes of 
memory. That is an enormous amount of RAM for such a problem instance. Brice 
BOYER et al. ([3]) solved this problem with fully in-place schedules of Strassen- 
Winograd's algorithm (see the following paragraph), if the input matrices can be 
overwritten. 
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Shmuel Winograd optimized Strassen's algorithm. The Strassen-Wino- 
GRAD algorithm (described in [11]) needs only 15 additions and subtractions, whereas 
S needs 18. Winograd had also shown (see [H]), that the minimum number of 
multiplications required to multiply 2x2 matrices is 7. Furthermore, Robert 
Probert ([H]) showed that 15 additive operations are necessary and sufficient to 
multiply two 2x2 matrices with 7 multiplications. 

Because of the bad properties of S with full recursion and large matrices, one 
can study the idea to use only one step of recursion. If n is even and we use one step 
of recursion of S (for the remaining products we use Af) the ratio of this operation 
count to that required by JV is (see 9 ) 

7n^ + lln^ „^oo 7 
8n3 - 4n2 ^ 8 ' 

Therefore the multiplication of two sufficiently large matrices using S costs approx- 
imately 12.5 % less than using M. 

Using the technique of stopping the recursion in the Strassen- Winograd al- 
gorithm early, there are well known implementations, as for example 

• on the Cray-2 from David Bailey ([2]), 

• GEMMW from DOUGLAS et al. ([7]) and 

• a routine in the IBM ESSL library routine ([10]). 

1.3. The aim of this work. Strassen's algorithm can only be used for {m2'^ x 
m2'^) matrices. For arbitrary {n x n) matrices one has to add zero rows and columns 
to the given matrices (see the next section) to use Strassen's algorithm. Strassen 
gave a strategy of how to set m and k for arbitrary n to ensure n < m2^ . In this 
paper we study the number d of additional zero rows and columns and the influence 
on the number of flops used by the algorithm in the worst case, best case and in 
the average case. 

It is known (TT]), that these parameters are not optimal. We only study the 
number d and the additional work caused by the bad parameters of Strassen. 
We give no better strategy of how to set m and k, and we do not analyze other 
strategies than the one from Strassen. 

2. Strassen's parameter for matrices of arbitrary order 

Algorithm S uses recursions to multiply matrices of order m2^ . If fc = then S 
coincides with the naive algorithm J\f. So we will only consider the case where 
A: > 0. To use S for arbitrary (n x n) matrices A and B (that means for arbitrary 
n) we have to embed them into matrices A and B which arc both [fi x n) matrices 
with n := m2}^ ^ n. We do this by adding i := n — n zero rows and colums to A 
and B. This results in 



AB 



nlxn nt 



B 0"^' 

n£xn nixt 



c, 



where 0*^^^ denotes the (fc x j) zero matrix. If we delete the last i columns and 
rows of C we get the result C = AB. 

We now focus on how to find m and k for arbitrary n with n ^ m2''. An optimal 
but purely theoretical choice is 

(to*, k*) — argmin{i^5(m, k) : {m,k) G N x No,ri ^ m2^}. 
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Further methods of finding m and k can be found in [11] . We choose another way. 
According to Strassen's proof of the main result of [M], we define 

k ■- [log2?iJ - 4 and m := \n2^^\ + 1, (2.1) 

where [a; J denotes the largest integer not greater than x. We define h :— m2^ and 
study the relationship between n and fi. The results are: 

• worst case: n ^ (17/16)n, 

• best case: n ^ n + I and 

• average case: h ~ (49/48)n. 



2.1. Worst case analysis. 

Theorem 2.1. Let n e N with n > 16. For the parameters (12. ip and m2^ 
have 

17 

16 



n we 



If n is a power of two, we have n 



16' 



Proof. For fixed n there is exactly one a g N with 2" < n < 2"+-'^. We define 
/« := {2", . . . , 2"+i - 1}. Because of ^^ for each n e /" the value of k is 



k = [log2 nj 

Possible values for m are 

1 



log2 2" 



a 



4. 



>a — 4 



1 = 



16 

I — 
2" 



+ 1 =: m[n). 



m{n) is increasing in n and r7i(2") — 17 and 771(2"+^) = 33. Therefore we have 
m G {17, . . . , 32}. For each n G /" one of the following inequalities holds: 



m 


2" 


== 16 • 2"-4 


< 


n 


< 


n ■2°'-^ 


im 




17 . 2"^* 


< 


n 


< 


18 • 2"^^ 



{Ift 



167 



31-2 



a — 4 



< n < 32-2 



a— 4 



2^+1 



Note that /" — l+J.^j^ If- It follows, that for all n S N there exists exactly one a 
with n ^ I" and for all n ^ I" there is exactly one j with n ^ I". 

Note that for all n e /" we have k — a ~ 4 and m{n) = j + 16. If we only focus 
on /" the difference h — n has its maximum at the lower end of /" (n is constant 
and n has its minimum at the lower end of /"). On /" the value of fi and the 
minimum of n are 



n- 



(16 + j)-2° 



Therefore the difference d" := n" 



and 



n" is constant: 



(15+j)-2° 



d^" = (16 + j)-2"-''-(15 + j)-2' 
To set this in relation with n we study 



a — 4 rja — 4 



for all j. 



df 



df 
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2 X 10^ 



1 X 10^ 



Fsi2^,l0-j) 







6 7 



10 



Figure 1. Different parameters (m — 2-', k = 10 - j, n = m2'') 
to apply in the Strassen algorithm for matrices of order 2^°. 



Finally r" is maximal, iff n" is minimal, which is the case for n" = 16 • 2" =2". 
With rf — 17/16 we get h < j^n, which completes the proof. D 

Now we want to use the result above to take a look at the number of flops we 
need for S in the worst case. The worst case is n = 2^ for any 4 < p G N. An 
optimal decomposition (in the sense of minimizing the number (n — n) of zero rows 
and columns we add to the given matrices) is ?Ti = 2-' and k — p ^ j, because 
7712*^ = 2^2^^^ = 2P = n. Note that these parameters m and k have nothing to do 
with equation (j2.1[) . Lets have a look on the influence of j : 

Lemma. Let n — 2^. In the decomposition n — m2^ we use m = 2^ and k — p — j. 
Then f{j) :— Fs{2'-> ,p — j) has its minimum at j — S. 

Proof. We have f{j) = 2 • 7^(8/7)^ + 5 • 7^(4/7)^' - 4^6. Thus 

fij + 1) - fU) 

= [2 • 7P(8/7)J'+i + 5 • 7P(4/7)^'+i - 4^6] - [2 • 7^(8/7)^' + 5 • 7^(4/7)^' - 4^6] 
= 2 ■ 7^(8/7)^(8/7 - 1) + 5 • 7^(4/7)^X4/7 - 1) 
= 2 ■ 7P(8/7)^' ■ 1/7 - 15 • 7^(4/7)^' • 1/7 = 2(4/7)^7^-^ (2^' - 7.5). 
Therefore, f{j) is a minimum if j = min{i : 2* — 7.5 > 0} = 3. D 

Figure [1] shows that it is not optimal to use S with full recursion in the example 
p = 10. Now we study the worst case n = 2^ and different sets of parameters m 
and k for Fs(m, k): 

(1) If we use equation (|2.ip . we get the original parameters of Strassen: k — 
p — 4 and to = 17. Therefore we define 



F,ip):=Fsil7,p-A) = 7P 



39 ■ 172 



4P 



6-172 



74 - 44 

(2) Obviously m = 16 would be a better choice, because with this we get 
TO,2'^ = n (we avoid the additional zero rows and columns). Now we define 

,37-162 



F2ip):=Fsil6,p-4) = 7P- 



74 



4P6. 
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Figure 2. Comparison of Strassen's parameters (/i, m = 17, 
k — p ~ A), obviously better parameters (/2, m = 16, k = p — A) 
and the optimal parameters (lemma, ?7i = 8, fc = p — 3) for the 
worst case n = 2^ . Limits of fj are dashed. 



(3) Finally we use the lemma above. With m = 8 = 2"^ and A; = p — 3 we get 
i^3(p):=F5(8,p-3) = 7^^-4P6. 

Now we analyze Fi relative to each other. Therefore we define /_,- := Fj/F^ 

which is monotonously decreasing in p. 

¥• ■ 7203 - 7P • 4736 



(j = l,2). So we have/,: {4,5,...}- 
With 

289(4P • 2401 - JP ■ 1664) 



hip) = 



12544(4P • 49 - 7P • 32) 



and f2{p) = 



147(4P • 49 - 7P • 32) 



we get 



/i(4) = 3179/2624 « 1.2115 
/2(4) = 124/123 « 1.00813 



lim fi{p) = 3757/3136 « 1.1980 
lim h{p) = 148/147 « 1.00680. 



Figure [5] shows the graphs of the functions fj. In conclusion, in the worst case the 
parameters of Strassen need approx. 20 % more flops than the optimal parameters 
of the lemma. 

2.2. Best case analysis. 

Theorem 2.2. Let n G N with n ^ 16. For the parameters (|2.ip and to2'^ = n we 
have 



n + 1 < h. 
Ifn = 2Pi-l, £e {16,..., 31}, we have n = r 



1. 



Proof. Like in Theorem 12.11 we have for each /" a constant value for n^ namely 
2"~^(16 + j). Therefore n < n holds. The difference n — n has its minimum at the 



upper end of /". There we have n 
This shows n + 1 < h. 



Hl6+j)-(2"-4(i6 + j-)_l) = l. 

D 



Let us focus on the flops we need for S, again. Lets have a look at the example 
n = 2^ — 1. The original parameters (see equation (|2.ip ) for S are k = p — 5 and 
TO — 32. Accordingly we define F{p) := ^^(32,^ — 5). Because 2^ — 1 < 2*^ we can 
add 1 zero row and column and use the lemma from the worst case. Now we get 
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the parameters m = 8 and k ~ p ~ 3 and define F[p) := Fs{8,p — 3). To analyze 
F and F relative to each other we have 

_ F{p) _ 4P ■ 16807 - 7P ■ 11776 
^ '~ F(p) ~ 4P • 16807 -7P- 10976' 

Note that r : {5, 6, . . .} ^ R is monotonously decreasing in p and has its maximum 
at p = 5. We get 

r(5) = 336/311 « 1.08039 lim r{p) = 11776/10976 w 1.07289. 

Therefore we can say: In the best case the parameters of Strassen are approx. 
8 % worse than the optimal parameters from the lemma in the worst case. 

2.3. Average case analysis. With En we denote the expected value of n. We 
search for a relationship like h k, jn for 7 G R. That means E[n/n] = 7. 

Theorem 2.3. For the parameters (|2.1|) of Strassen to2'^ — fi we have 

^~ 49 
En = — n. 
48 

Proof. First we focus only on /". We write E" := E|/c, and E" := E|/a for the 
expected value on /" and /", resp. We have 

,16 1 ^^ 1 ^^ 

j=i j=i j=i 

Together with E"n = i[(2"+i - 1) + 2"] = 2" + 2"-^ - 1/2, we get 

E"n 2"-5 • 49 



p{a) := E" 



E"n 2" + 2"-i-l/2 



Now we want to calculate E^ :— E|[/(fc-) [n/n], where U{k) := l+J.^p ^"'^"' ^y using the 
values p(j). Because of |/5| = 2\I^\ and j/^'U/^l = 3|/4| we have Ei = ip(4) + |p(5). 
With the same argument we get 

4+fc oj-4 

Efe = E^J^(-^') '^^'^''^ ^J' " 2fc+i-l ' 

J=4 



Finally we have 

E 



= lim Efc= lim ( f]/3,p(j)) 
\ i=4 / 

- lim ( ^^ V 2^J-9 \ _ 49 

~ fe^i 1^ 2*^+1 - 1 ^ 2J + 2J-1 - l/2y ~ 48' 

what we intended to show. D 

Compared to the worst case (n < jln, 17/16 = 1 + 1/16), note that 49/48 = 
l + l/48 = l + i-3^. 
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3. Conclusion 

Strassen used the parameters m and k in the form (j2.ip to show that his matrix 
muhiphcation algorithm needs less than 4.7ri'°^2 7 flops. We could show in this 
paper, that these parameters cause an additional work of approx. 20 % in the worst 
case in comparison to the optimal strategy for the worst case. This is the main 
reason for the search for better parameters, like in [H]. 
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